1. Introduction

MAVEQC is a flexible R-package that provides QC analysis of Saturation Genome Editing (SGE) experimental data. Available under GPL 3.0 from https://github.com/wtsi-hgi/MAVEQC


2. Screen QC

Displays QC plots and statistics for all samples for QC.


2.1. Sample Sheet


2.2. Run Sample QC

2.2.1. Read Length Distribution

Displays the percentage of reads for each sample, based on 50 nucleotide increments, using the total number of raw reads.

Note: expected read length is 300.

Note: the lengths of primers are deducted from the read length based on the sample sheet information. (see 2.1. Sample Sheet: quants_append_start and quants_append_end)

Pass criterion: more than 90% of reads are longer than 200 nucleotides


2.2.2. Missing Variants

Stats of missing variants in the library

Pass criterion: less than 1% of expected variants are missing


Records of missing variants in the library

Note: Unique indicates that a template sequence occurs only once in the VaLiAnT meta file. (1: Unique, 0: Not Unique)This is important as a template sequence can occur more than once depending on the mutation types applied in VaLiAnT.

Note: Table below shows all the missing variants in all the samples, so the variants may occur multiple times.


2.2.3. Total Reads (Counts)

Displays the total number of reads per sample. Filtering based on 1-dimensional Kmean clustering that excludes unique sequences with low read counts.

  • Accepted reads: Total read count for all unique sequences with sufficient reads based 1D Kmean clustering.
  • Excluded reads: Total read count for all unique sequences with insufficient reads based 1D Kmean clustering.

Total Reads: the total number of raw reads

Pass criterion: more than 1,000,000 total reads


2.2.4. Accepted Reads (Percentage)

Displays the percentage of library reads vs non-library reads (ie. Reference, PAM and Unmapped) for Accepted Reads (see 2.2.3 explanation).

  • Library Reads: Percentage reads mapping to template oligo sequences, including intended variants.
  • Reference Reads: Percentage reads mapping to Reference.
  • PAM Reads: Percentage reads mapping to PAM/Protospacer Protection Edits (PPEs), without intended variant.
  • Unmapped Reads: Percentage of Unmapped Reads (not mapped to library sequences, PAM sequence, and reference sequence).
  • Library Coverage: Mean read count per template oligo sequence.

Pass criterion: more than 40% of accepted reads are library reads

Note: Accepted reads are the filtered reads based on 2.2.3

Defines the mean read count per template oligo sequence (dividing the total number of library reads by the total number of library sequences).

Pass criterion: library coverage is more than 100 reads


2.2.5. Genomic Coverage

Distribution of variants across targeton region based on log2(count+1) values.

Note: Does not show missing varaints (0 count in the libary).

Low Abundance cutoff: the green dashed line indicates the threshold which is used to determine if the variant is low abundance (less than 5 reads)

Pass criterion: the percentage of low-abundance variants is lower than 30%

% Low Abundance: the percentage of variants below the low abundance cutoff


2.2.6. Genomic Position Percentage

Displays distribution of “LOF” (loss-of-function) vs all “Other” variants across the targeton region, based on read percentages for reference timepoint. Requires concordant distribution of LOF and Other variants.

Note: Does not show missing varaints (0 count in the libary).


2.3. Run Experiment QC

2.3.1. Sample Correlations



2.3.2. Sample PCA


2.3.3. Fold Change (by category)

condition_Day7_vs_Day4

condition_Day15_vs_Day4


2.3.4. Fold Change (by position)

condition_Day7_vs_Day4

condition_Day15_vs_Day4



3. QC Results

Summarising the final results, below are the cutoffs using for PASS/FAIL


3.1. Sample QC Results


4. Methods and Glossary

4.1. Methods

4.1.1. Methods of generating accepted reads (refer to 2.2.3)

  1. Apply 1D Kmean clustering on each sequence (variant sequence) using log2 read count. Low read count sequences are removed in this step
  2. A valid sequence must have at least 5 count in at least 25% of the samples in an experiment.

4.1.2. Methods of DESeq2 calculation (refer to 2.3.3 and 2.3.4)

  1. using the total number of accepted reads to calculate the size factor which is applied in DESeq2 normalisation
  2. run DESeq2 for each consequence
  3. select synonymous variants and intronic variants as the control, then calculate the median log2 fold change of the control variants
  4. re-calculate log2 fold change of other consequences by deducting the median log2 fold change of the control variants
  5. re-calculate p value and adjusted p value

4.2. Glossary

4.2.1. Glossary of DESeq2 calculation (refer to 2.3.3 and 2.3.4)

name description
log2FoldChange This is the initial log2FoldChange from DESeq2 using all the accepted reads
lfcSE This is the initial lfcSE (log2FoldChange Standard Error) from DESeq2 using all the accepted reads
padj This is the initial corrected p-value from DESeq2 using all the accepted reads
median control value This is the median log2 fold change of the control variants (synonymous variants and intronic variants)
adj_log2FoldChange This is the adjusted log2FoldChange calculated by deducting the median control value
adj_score This is the adjusted score that is calculated from the adj_log2FoldChange divided by the lfcSE
adj_pval This is the adjusted p-value derived from adj_score
adj_fdr This is the adjusted FDR derived from adj_pval
stat This indicates an enriched or a depleted status. adj_fdr < 0.05 & adj_log2FoldChange > 0 is enriched, adj_fdr < 0.05 & adj_log2FoldChange < 0 is depleted